Errors due to hardware or low level software problems, if detected, can befixed by various schemes, such as recomputation from a checkpoint. Silenterrors are errors in application state that have escaped low-level errordetection. At extreme scale, where machines can perform astronomically manyoperations per second, silent errors threaten the validity of computed results. We propose a new paradigm for detecting silent errors at the applicationlevel. Our central idea is to frequently compare computed values to thoseprovided by a cheap checking computation, and to build error detectors based onthe difference between the two output sequences. Numerical analysis provides uswith usable checking computations for the solution of initial-value problems inODEs and PDEs, arguably the most common problems in computational science.Here, we provide, optimize, and test methods based on Runge-Kutta and linearmultistep methods for ODEs, and on implicit and explicit finite differenceschemes for PDEs. We take the heat equation and Navier-Stokes equations asexamples. In tests with artificially injected errors, this approach effectivelydetects almost all meaningful errors, without significant slowdown.
展开▼